Putting Visual Analytics into Practical Use
We are required to attempt one of the three questions under Challenge 3 of VAST Challenge 2022 on the economic of the city of Engagement, Ohio USA by using appropriate static and interactive statistical graphics methods.
With that, I have chosen #3 which is to:
Describe the health of the various employers within the city limits. What employment patterns do you observe? Do you notice any areas of particularly high or low turnover? Limit your response to 10 images and 500 words.
About 1000 representative residents have volunteered to provide data using the city’s urban planning app, which records the places they visit, their spending, and their purchases, among other things; totaling to 3 dataset folders (Activity Logs, Attributes, Journals). For this exercise, we refers to Empolyers that offers jobs as stipulated in the Jobs dataset; we thus selected the dataset Employers, Buildings, Jobs & Participants from the Attributes folder, and CheckinJournal from the Journals folder which give details on the employers, employees and jobs.
The packages tidyverse (including dplyr, ggplot2, patchwork), ggrepel, lubridate, gapminder, gganimate, ggiraph etc will be used for the purpose of this exercise:
The code chunk below is used to install and load the required packages onto RStudio.
packages = c('tidyverse','treemap','ggrepel','lubridate','gapminder','gganimate','ggiraph','plotly','zoo')
for(p in packages){
if(!require(p, character.only =T)){
install.packages(p)
}
library(p, character.only =T)
}
The code chuck below import Employers.csv,
Buildings.csv, Jobs.csv, CheckinJournal.csv
and Participants.csv from the data folder into R by using
read_csv() and save it as an tibble data frame.
Employers <- read_csv("data/Employers.csv")
Buildings <- read_csv("data/Buildings.csv")
Jobs <- read_csv("data/Jobs.csv")
Checkin <- read_csv("data/CheckinJournal.csv")
Participants <- read_csv("data/Participants.csv")
First, let’s get a general sense of our data using the function
summary().
summary(Employers)
employerId location buildingId
Min. : 379 Length:253 Min. : 3.0
1st Qu.: 829 Class :character 1st Qu.: 261.0
Median :1279 Mode :character Median : 486.0
Mean :1089 Mean : 517.8
3rd Qu.:1734 3rd Qu.: 782.0
Max. :1797 Max. :1041.0
summary(Buildings)
buildingId location buildingType
Min. : 1.0 Length:1042 Length:1042
1st Qu.: 261.2 Class :character Class :character
Median : 521.5 Mode :character Mode :character
Mean : 521.5
3rd Qu.: 781.8
Max. :1042.0
maxOccupancy units
Min. : 1.00 Length:1042
1st Qu.: 5.00 Class :character
Median : 7.00 Mode :character
Mean : 15.51
3rd Qu.: 12.00
Max. :418.00
NA's :539
summary(Jobs)
jobId employerId hourlyRate startTime
Min. : 0.0 Min. : 379 Min. : 10.00 Length:1328
1st Qu.: 331.8 1st Qu.: 438 1st Qu.: 10.03 Class1:hms
Median : 663.5 Median : 884 Median : 14.74 Class2:difftime
Mean : 663.5 Mean :1059 Mean : 19.13 Mode :numeric
3rd Qu.: 995.2 3rd Qu.:1337 3rd Qu.: 23.32
Max. :1327.0 Max. :1797 Max. :100.00
endTime daysToWork educationRequirement
Length:1328 Length:1328 Length:1328
Class1:hms Class :character Class :character
Class2:difftime Mode :character Mode :character
Mode :numeric
summary(Checkin)
participantId timestamp venueId
Min. : 0.0 Min. :2022-03-01 05:35:00 Min. : 0
1st Qu.: 221.0 1st Qu.:2022-06-10 18:30:00 1st Qu.: 449
Median : 464.0 Median :2022-10-03 20:25:00 Median : 910
Mean : 480.5 Mean :2022-10-05 07:41:29 Mean :1015
3rd Qu.: 726.0 3rd Qu.:2023-01-28 08:10:00 3rd Qu.:1358
Max. :1010.0 Max. :2023-05-25 00:05:00 Max. :1805
venueType
Length:2100635
Class :character
Mode :character
Employers dataset is joined with Buildings dataset
based on buildingId to filter only relevant information
from Buildings dataset in regards to employers. Left join on
Employers is used as commercial buildings are a subset of the
different type of buildings.
Employers <- Employers %>% left_join(Buildings,by="buildingId")
Next, we use outer join on Employers and Jobs based
on employerId to have a full overview on all the jobs that
are offered by each employer.
Employers = merge(x=Employers,y=Jobs,by="employerId",all=TRUE)
First, we start by renaming the columns and values of in
Employers and Checkin dataset using the function rename(),
and sub()
for a better format and ease of reading.
Note: A check between both datasets shows that venueId
in Checkin dataset refers to the employerId,
pubId etc. For the purpose of this exercise, we are only
interested in the employerId (venueType =
Workspace) and other venues type will be removed
subsequently.
# rename columns
Employers <- Employers %>%
rename('Employer_ID' = 'employerId',
'Location(Pt)' = 'location.x',
'Location(Area)' = 'location.y',
'Building_ID' = 'buildingId',
'Building_Type' = 'buildingType',
'Max_Occupancy' = 'maxOccupancy',
'Units' = 'units',
'Job_ID' = 'jobId',
'Hourly_Rate' = 'hourlyRate',
'Start_Time' = 'startTime',
'End_Time' = 'endTime',
'Days_To_Work' = 'daysToWork',
'Education_Level' = 'educationRequirement')
Checkin <- Checkin %>%
rename('Participant_ID' = 'participantId',
'Timestamp' = 'timestamp',
'Employer_ID' = 'venueId',
'Venue_Type' = 'venueType')
Jobs <- Jobs %>%
rename('Education_Level' = 'educationRequirement')
Participants <- Participants %>%
rename('Participant_ID' = 'participantId',
'Household_Size' = 'householdSize',
'Have_Kids' = 'haveKids',
'Age' = 'age',
'Education_Level' = 'educationLevel',
'Interest_Group' = 'interestGroup',
'Joviality' = 'joviality')
#rename row values
Employers$Education_Level <- sub('HighSchoolOrCollege',
'High School or College',
Employers$Education_Level)
Jobs$Education_Level <- sub('HighSchoolOrCollege',
'High School or College',
Jobs$Education_Level)
Participants$Education_Level <- sub('HighSchoolOrCollege',
'High School or College',
Participants$Education_Level)
We are using Checkin to see the changes in employment
(i.e. checkin by participants at workplace) over time. We see that the
venueId column in the dataset are IDs of all possible
venues such as work place, restaurants and pubs.
Given that we are only interested in workplace, we will first filter
out row that reads “workplace” using grep().
We then assign a running week number and also day to day percentage change in number of employees for each employer.
#Extract the date from timestamp
Checkin$Date <- as.Date(Checkin$Timestamp)
#Filter rows with workplace as value
Workplace_Checkin <- Checkin[grep("Workplace", Checkin$Venue_Type),]
#Assign Running Week Number
Workplace_Checkin <- Workplace_Checkin %>%
mutate(Week_Num = as.double(ceiling(difftime(Workplace_Checkin$Date, strptime("01.03.2022", format = "%d.%m.%Y"),units="weeks"))))
#Compute no. of employees that report to work during that week (5day work week)
Count_Checkin <- Workplace_Checkin %>%
group_by(Week_Num, Employer_ID) %>%
summarise('Num_of_Employees'= n_distinct(Participant_ID)) %>%
ungroup()
#Calculate Percentage Change
Count_Checkin <- Count_Checkin %>%
group_by(Employer_ID) %>%
mutate(
Perc_Chg = round((Num_of_Employees - lag(Num_of_Employees))/lag(Num_of_Employees)*100,2)
)
Next, we put up a series of charts to address the questions.
We see the absolute and percentage change in number of employee for each employer over each work week
Count_Checkin$Employer_ID <- as.character(Count_Checkin$Employer_ID)
p<- ggplot(Count_Checkin, aes(x=Week_Num, y=Num_of_Employees, group=Employer_ID)) +
geom_line(aes(color=Employer_ID),show.legend = FALSE)+
ylim(1,15) +
ggtitle(label = "Number of Employees For Each Work Week")+
xlab("Week Number from Mar 22 to May 23") +
ylab("No. of Employees") +
theme_minimal()+
theme(plot.title = element_text(size=12, face="bold",hjust = 0.5))
ggplotly(p)
Most of the manpower movement happens on the first two weeks where alot employement dropped, noting sharp decline for Employer 1763 and 868; which started off with 14 and 15 in Week 1 and then dropped to 6 employees in Week 2.
p<- ggplot(Count_Checkin, aes(x=Week_Num, y=Perc_Chg,group=Employer_ID)) +
geom_line(aes(color=Employer_ID), show.legend = FALSE)+
ggtitle(label = "Percentage % Change of Employees For Each Work Week")+
xlab("Week Number from Mar 22 to May 23") +
ylab("% Change") +
theme_minimal()+
theme(plot.title = element_text(size=12, face="bold",hjust = 0.5))
ggplotly(p)
Similarly, most of the decline happened in the first two weeks and remained stabled mostly throughout untill the last week.
Next, we want to know the age profile of employees for each employer. We grouped them into various age groups for easier readability.
We use left join to find out the employer for participants and the respective age group they are in.
Participants_AgeGrp <- Participants %>%
select(Participant_ID, Age, Education_Level)
Workplace_Checkin<-
left_join(Workplace_Checkin,Participants_AgeGrp,by = 'Participant_ID')
We now use raincloud plot to see the wage distribution for different education level.
Jobs$Education_Level = factor(Jobs$Education_Level, levels = c('Low', 'High School or College', 'Bachelors','Graduate'))
p<- ggplot(Jobs, aes(x = Education_Level, y = hourlyRate, fill=Education_Level)) +
ggdist::stat_halfeye(
adjust = .5,
width = .6,
.width = 0,
justification = -.3,
point_colour = NA) +
geom_boxplot(
width = .25,
outlier.shape = NA
) +
geom_point(
size = 1.3,
alpha = .3,
position = position_jitter(
seed = 1, width = .1
)
) +
coord_cartesian(xlim = c(1.2, NA), clip = "off")+
ggtitle(label = "Wage Distribution for Different Education Level",
subtitle = "High Wages For Higher Educated")+
theme_minimal()+
theme(plot.title = element_text(size=14, face="bold",hjust = 0.5),
plot.subtitle = element_text(size=12,hjust = 0.5,color='mediumvioletred'))+
theme(axis.title.y= element_text(angle=0), axis.ticks.x= element_blank(),
panel.background= element_blank(), axis.line= element_line(color= 'grey'))
ggplotly(p)
We see that indeed higher education are paid better.
We use a heatmap with treepmap to see the age profile
for each employer.
d3tree(tm, rootname = "Age and No. of Employees under Each Employer" )